Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update robots.txt #2632

Merged
merged 1 commit into from
Aug 18, 2019
Merged

Update robots.txt #2632

merged 1 commit into from
Aug 18, 2019

Conversation

dsofeir
Copy link
Contributor

@dsofeir dsofeir commented Aug 18, 2019

I have found that Bing/Yahoo/DuckDuckGo, Yandex and Google report crawl errors when using the default robots.txt. Specifically their bots will not crawl the the path '/' or any sub-paths. I agree that the current robots.txt should work and properly implements the specification. However it still does not work.

In my experience explicitly permitting the path '/' by adding the directive Allow: / resolves the issue.

More details can be found in a blog post about the issue here: https://www.dfoley.ie/blog/starting-with-the-indieweb

I have found that Bing/Yahoo/DuckDuckGo, Yandex and Google report crawl errors when using the default robots.txt. Specifically their bots will not crawl the the path '/' or any sub-paths. I agree that the current robots.txt should work and properly implements the specification. However it still does not work.

In my experience explicitly permitting the path '/' by adding the directive Allow: / resolves the issue.

More details can be found in a blog post about the issue here: https://www.dfoley.ie/blog/starting-with-the-indieweb
@rhukster rhukster merged commit ed87faa into getgrav:develop Aug 18, 2019
@kinger-de
Copy link

kinger-de commented Oct 22, 2019

I have a similar problem with the robots.txt. I have published a page with an image under '/user/pages/01.page/01._module/default.md' and '/user/pages/01.page/01._module/default.jpg'. The image url in the html representation is 'domain.tld/user/pages/01.page/01._module/default.jpg'.

The GoogleBot can crawl all sites without any problems. But he not index the image. I tested the image url with the search console and got the message that the image cant be in the index because it is blocked by the robots.txt. If i test the same url with the Google robots.txt tester everthing looks fine. The rule 'Allow: /user/pages/' is highlighted. And the Response is 200.

I've tested it with the 'Allow: /' rule also. No succed. And i've tested it with the allow-rules before the disallow-rules. Nothing helped.

Every robots.txt tester say the robots.txt is fine, and the image url is not blocked. Except the Google-search console. Any hint how i can get the image in the index?

@hughbris
Copy link
Contributor

Does anyone know a way to report these clearly identified errors to these providers? I only know that I've concluded it's futile trying to communicate to Google about their products. Even when they supposedly have channels open, nothing happens, not even an acknowledgement of the message. Not sure about the others.

@kinger-de
Copy link

I think Google is just listening. Whether and what will be changed in their products will probably be decided elsewhere. As I understand it, the robots.txt tester is not designed to check the indexing of the images. And the feedback from the tool is not generally valid.

For my problem I have now created an extra sitemap for pictures. That helped.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants